An Analysis on the Relationship between Income, Education Level and Capital Gain

Roman Shrestha

December 9, 2022

An Analysis on the Relationship between Income, Education Level and Capital Gain

Motivation

In September 2019, the Census Bureau reported that income inequality in the United States had reached its highest level in 50 years.

Research Question

Is there a relationship between a person’s income and their educational background, and capital gain in the United States?

Hypothesis

As the education level and capital gain of a person increases, their income will also increase.

Introduction to Data

The dataset we use was extracted by Barry Becker from the 1994 US Census Bureau’s database and was later donated to the Machine Learning Repository of University of California, Irvine Census Income Data Set.

  • 32561 data set instances

  • 15 attributes, includes variables like a person’s education level, race, capital gain

  • mix of continuous and discrete data

  • key explanatory variables: education, num_education and capital gain

  • outcome variable: income

  • control variables: marital status, occupation, race, and sex

# A tibble: 6 × 16
    age workclass     fnlwgt educa…¹ educa…² marit…³ occup…⁴ relat…⁵ race  sex  
  <dbl> <chr>          <dbl> <fct>     <dbl> <fct>   <chr>   <chr>   <fct> <fct>
1    50 Self-emp-not…  83311 bachel…      13 Marrie… Exec-m… Husband White Male 
2    38 Private       215646 high-s…       9 Divorc… Handle… Not-in… White Male 
3    53 Private       234721 some-h…       7 Marrie… Handle… Husband Black Male 
4    28 Private       338409 bachel…      13 Marrie… Prof-s… Wife    Black Fema…
5    37 Private       284582 masters      14 Marrie… Exec-m… Wife    White Fema…
6    49 Private       160187 some-h…       5 Marrie… Other-… Not-in… Black Fema…
# … with 6 more variables: capital_gain <dbl>, capital_loss <dbl>,
#   hours_per_week <dbl>, native_country <fct>, income <fct>, income_bin <dbl>,
#   and abbreviated variable names ¹​education, ²​education_num, ³​marital_status,
#   ⁴​occupation, ⁵​relationship

Exploratory Data Analysis: Plots

Histogram of Education and Income

Bar plot of Education and Income

Density plot of Capital gain vs Income

Residual plot

Exploratory Data Analysis: Statistics

Summary statistics

Table 2
Education Income Group Mean of Education Years Median of Education Years SD of Education Years Mean of Capital Gain SD of Capital Gain
some_primary_middle_school <=50K 3.301 4 0.875 172.493 1413.253
some_primary_middle_school >50K 3.548 4 0.670 1302.468 2718.439
some-hs-school <=50K 6.497 7 0.932 108.069 854.967
some-hs-school >50K 6.544 7 0.955 3398.736 13155.117
high-school-grad <=50K 9.000 9 0.000 153.879 986.443
high-school-grad >50K 9.000 9 0.000 2805.281 12044.407
some-college <=50K 10.000 10 0.000 125.685 825.845
some-college >50K 10.000 10 0.000 2612.823 10605.908
associate <=50K 11.440 11 0.497 158.799 776.956
associate >50K 11.423 11 0.494 2207.693 7035.267
bachelors <=50K 13.000 13 0.000 162.260 837.939
bachelors >50K 13.000 13 0.000 4004.705 14024.036
masters <=50K 14.000 14 0.000 285.775 1792.404
masters >50K 14.000 14 0.000 4376.397 14284.895
prof-school <=50K 15.000 15 0.000 186.954 751.534
prof-school >50K 15.000 15 0.000 14113.712 30740.170
doctorate <=50K 16.000 16 0.000 220.486 899.842
doctorate >50K 16.000 16 0.000 6361.039 19783.396

Linear Model and Hypothesis Testing

Linear Model

  • Every dollar increase in capital gain increases probability of higher income

  • Every education level increase increases the probability of higher income

  • The slope for all of our explanatory variables are positive, there seems to be a positive linear correlation between our explanatory variables and outcome variables.

  • However, on plotting the residual plot, we find out that the relationship between our variables is not linear.

  • There could be a better model other than the linear regression model to explain the relationship.

Table 3
Term Estimate Std Error Statistic P Value
(Intercept) -0.1616961 0.0223596 -7.2316260 0.0000000
capital_gain 0.0000083 0.0000003 30.7411302 0.0000000
educationsome-hs-school 0.0153229 0.0211389 0.7248664 0.4685392
educationhigh-school-grad 0.0319746 0.0328280 0.9740051 0.3300613
educationsome-college 0.0797344 0.0381008 2.0927212 0.0363818
educationassociate 0.0919037 0.0460746 1.9946700 0.0460872
educationbachelors 0.2077272 0.0541460 3.8364290 0.0001251
educationmasters 0.3052909 0.0599616 5.0914368 0.0000004
educationprof-school 0.3557377 0.0664992 5.3495011 0.0000001
educationdoctorate 0.4070083 0.0723654 5.6243496 0.0000000
education_num 0.0138270 0.0055156 2.5068732 0.0121853

Hypothesis Testing

  • Null hypothesis: there is no association between income and education

  • Alternative hypothesis: there is an association between income and education

  • Chi-squared Test for our hypothesis testing

  • Chi-squared value is 4427 and p value was negligible

  • We reject the null hypothesis

  • There is an association between income and education level

X-squared 
 4426.764 

Conclusion and Discussion

  • Our statistical analysis verified the hypothesis that as the education level and capital gain of a person increases, their income will also increase.

  • The results produced from linear regression, hypothesis testing, and EDA suggest a general relationship between our observed variables, and even more so a positive predictive relationship.

  • This research only provides a glimpse into the factors that influence a person’s income from two decades ago, and does not necessarily apply to present time.

  • We recognize that we do not have the coding experience in order to properly compute non-linear models we need, as referenced previously on regression analysis.

  • We encourage future research on factors that influence a person’s (in)accessibility to education, since our result shows that education is a key indicator of income.